Web-scale Topic Models in Spark: An Asynchronous Parameter Server

نویسندگان

  • Rolf Jagerman
  • Carsten Eickhoff
چکیده

In this paper, we train a Latent Dirichlet Allocation (LDA) topic model on the ClueWeb12 data set, a 27-terabyte Web crawl. We extend Spark, a popular tool for performing large-scale data analysis, with an asynchronous parameter server. Such a parameter server provides a distributed and concurrently accessed parameter space for the model. A Metropolis-Hastings based collapsed Gibbs sampler is implemented using this parameter server achieving an amortized O(1) sampling complexity. We compare our implementation to the default Spark implementations and show that it is several orders of magnitude more scalable without sacrificing model quality. A topic model with 1,000 topics is trained on the full ClueWeb12 data set, uncovering some of the prevalent themes that appear on the Web.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Threading Model to Boost Server Performance

Multi-threading remains a popular choice for server architecture. Widely used applications like the Apache web server, and the MySQL database server are written in a multi-threaded fashion. We consider thread architectures from two angles: (1) number of user threads per kernel thread, and (2) use of synchronous I/O vs. asynchronous I/O, and consider their effects on server performance. Our clai...

متن کامل

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not...

متن کامل

LightLDA: Big Topic Models on Modest Compute Clusters

When building large-scale machine learning (ML) programs, such as massive topics models or deep networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-sized clusters with thousands of nodes, which are out of reach for most practitioners or academic researchers. We consider this challenge in the context...

متن کامل

Investigation on Reliability Estimation of Loosely Coupled Software as a Service Execution Using Clustered and Non-Clustered Web Server

Evaluating the reliability of loosely coupled Software as a Service through the paradigm of a cluster-based and non-cluster-based web server is considered to be an important attribute for the service delivery and execution. We proposed a novel method for measuring the reliability of Software as a Service execution through load testing. The fault count of the model against the stresses of users ...

متن کامل

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

In distributed ML applications, shared parameters are usually replicated among computing nodes to minimize network overhead. Therefore, proper consistency model must be carefully chosen to ensure algorithm’s correctness and provide high throughput. Existing consistency models used in generalpurpose databases and modern distributed ML systems are either too loose to guarantee correctness of the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1605.07422  شماره 

صفحات  -

تاریخ انتشار 2016